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1  Abstract 


The  demand  for  streaming  multimedia  applieations  is  growing  at  a  fast  rate.  In  this  report,  we 
present  Bayeux,  an  effieient  applieation-level  multieast  system  that  seales  to  arbitrarily  large  re- 
eeiver  groups  while  tolerating  failures  in  routers  and  network  links.  Bayeux  also  ineludes  speeifie 
meehanisms  for  load-balaneing  aeross  replieate  root  nodes  and  more  effieient  bandwidth  eonsump- 
tion.  Our  simulation  results  indieate  that  Bayeux  maintains  these  properties  while  keeping  trans¬ 
mission  overhead  low  (i.e.,  overlay  routing  lateney  is  only  2-3  times  of  the  physieal  shortest  path 
lateney  and  redundant  paeket  duplieation  is  a  85-fold  improvement  over  naive  unieast).  To  aehieve 
these  properties,  Bayeux  leverages  the  arehiteeture  of  Tapestry,  a  fault-tolerant,  wide-area  overlay 
routing  and  loeation  network. 


2  Introduction 


The  demand  for  streaming  multimedia  applieations  is  growing  at  an  ineredible  rate.  Sueh  appli¬ 
eations  are  distinguished  by  a  single  writer  (or  small  number  of  writers)  simultaneously  feeding 
information  to  a  large  number  of  readers.  Current  trends  indieate  a  need  to  seale  to  thousands  or 
millions  of  reeeivers.  To  say  that  sueh  applieations  stress  the  eapabilities  of  wide-area  networks  is 
an  understatement.  When  millions  of  reeeiving  nodes  are  involved,  unieast  is  eompletely  impraeti- 
eal  beeause  of  its  redundant  use  of  link  bandwidth;  to  best  utilize  network  resourees,  reeeivers  must 
be  arranged  in  effieient  eommunieation  trees.  This  in  turn  requires  the  effieient  eoordination  of  a 
large  number  of  individual  eomponents,  leading  to  a  eoneomitant  need  for  resilienee  to  node  and 
link  failures. 

Given  barriers  to  wide-spread  deployment  of  IP  multieast,  researehers  have  turned  to  applieation- 
level  solutions.  The  major  ehallenge  is  to  build  an  effieient  network  of  unieast  eonneetions  and  to 
eonstruet  data  distribution  trees  on  top  of  this  overlay  strueture.  Currently,  there  are  no  designs  for 
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application-level  multicast  protocols  that  scale  to  thousands  of  members,  incur  both  minimal  delay 
and  bandwidth  penalties,  and  handle  faults  in  both  links  and  routing  nodes. 

In  this  report  we  present  Bayeux,  an  efficient,  source-specific,  explicit-join,  application-level 
multicast  system  that  has  these  properties.  One  of  the  novel  aspects  of  Bayeux  is  that  it  combines 
randomness  for  load  balancing  with  locality  for  efficient  use  of  network  bandwidth.  Bayeux  utilizes 
a  prefix-based  routing  scheme  that  it  inherits  from  an  existing  application-level  routing  protocol 
called  Tapestry  [36],  a  wide-area  location  and  routing  architecture  used  in  the  OceanStore  [16] 
globally  distributed  storage  system.  On  top  of  Tapestry,  Bayeux  provides  a  simple  protocol  that 
organizes  the  multicast  receivers  into  a  distribution  tree  rooted  at  the  source.  Simulation  results 
indicate  that  Bayeux  scales  well  beyond  thousands  of  multicast  nodes  in  terms  of  overlay  latency 
and  redundant  packet  duplication,  for  a  variety  of  topology  models. 

In  addition  to  the  base  multicast  architecture,  Bayeux  leverages  the  Tapestry  infrastructure  to 
provide  simple  load-balancing  across  replicated  root  nodes,  as  well  as  reduced  bandwidth  consump¬ 
tion,  by  clustering  receivers  by  identifier.  The  benefifs  of  these  optimizing  mechanisms  are  shown 
in  simulation  results.  Finally,  Bayeux  provides  a  variety  of  protocols  to  leverage  the  redundant  rout¬ 
ing  structure  of  Tapestry.  We  evaluate  one  of  them.  First  Reachable  Link  Selection,  and  show  it  to 
provide  near-optimal  fault-resilient  packet  delivery  to  reachable  destinations,  while  incurring  a  low 
overhead  in  terms  of  membership  state  management. 

In  the  rest  of  this  report  we  discuss  the  architecture  of  Bayeux  and  provide  simulation  results. 
First,  Section  3  describes  the  Tapestry  routing  and  location  infrastructure.  Next,  Section  4  describes 
the  Bayeux  architecture,  followed  by  Section  5  which  evaluates  it.  In  Section  6,  we  explore  novel 
scalability  optimizations  in  Bayeux,  followed  by  fault-resilient  packet  delivery  in  Section  7.  We 
discuss  related  work  in  Section  8.  Finally,  we  discuss  future  work  and  conclude  in  Section  10. 


3  Tapestry  Routing  and  Location 


Our  architecture  leverages  Tapestry,  an  overlay  location  and  routing  layer  presented  by  Zhao,  Ku- 
biatowicz  and  Joseph  in  [36].  Bayeux  uses  the  natural  hierarchy  of  Tapestry  routing  to  forward 
packets  while  conserving  bandwidth.  Multicast  group  members  wishing  to  participate  in  a  Bayeux 
session  become  (if  not  already)  Tapestry  nodes,  and  a  data  distribution  tree  is  built  on  top  of  this 
overlay  structure. 

The  Tapestry  location  and  routing  infrastructure  uses  similar  mechanisms  to  the  hashed-suffix 
mesh  introduced  by  Plaxton,  Rajaraman  and  Richa  in  [21].  It  is  novel  in  allowing  messages  to  locate 
objects  and  route  to  them  across  an  arbitrarily-sized  network,  while  using  a  routing  map  with  size 
logarithmic  to  the  network  namespace  at  each  hop.  Tapestry  provides  a  delivery  time  within  a  small 
factor  of  the  optimal  delivery  time,  from  any  point  in  the  network.  A  detailed  discussion  of  Tapestry 
algorithms,  its  fault-tolerant  mechanisms  and  simulation  results  can  be  found  in  [36]. 

Each  Tapestry  node  or  machine  can  take  on  the  roles  of  server  (where  objects  are  stored), 
router  (which  forward  messages),  and  client  (origins  of  requests).  Also,  objects  and  nodes  have 
names  independent  of  their  location  and  semantic  properties,  in  the  form  of  random  fixed-length 
bit-sequences  represented  by  a  common  base  (e.g.,  40  Hex  digits  representing  160  bits).  The  sys- 
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Figure  1 :  Tapestry  routing  example.  Here  we  see  the  path  taken  by  a  message  originating  from  node 
0  32  5  destined  for  node  4598  in  a  Tapestry  network  using  hexadeeimal  digits  of  length  4  (65536 
nodes  in  namespaee). 

tern  assumes  entries  are  roughly  evenly  distributed  in  both  node  and  objeet  namespaees,  whieh  ean 
be  aehieved  by  using  the  output  of  seeure  one-way  hashing  algorithms,  sueh  as  SHA-1  [25]. 

3.1  Routing  Layer 

Tapestry  uses  loeal  routing  maps  at  eaeh  node,  ealled  neighbor  maps,  to  inerementally  route  overlay 
messages  to  the  destination  ID  digit  by  digit  (e.g.,  ***8  **9S  *598  45  98  where 

*’s  represent  wildeards).  This  approaeh  is  similar  to  longest  prefix  routing  in  the  CIDR  IP  address 
alloeation  arehiteeture  [24].  A  node  N  has  a  neighbor  map  with  multiple  levels,  where  eaeh  level 
represents  a  matehing  suffix  up  fo  a  digif  posifion  in  fhe  ID.  A  given  level  of  fhe  neighbor  map 
eonfains  a  number  of  enfries  equal  fo  fhe  base  of  fhe  ID,  where  fhe  flh  enfry  in  fhe  jfh  level  is  fhe 
ID  and  loeafion  of  fhe  elosesf  node  whieh  ends  in  “*”-i-suffix(Af,  j  —  1).  For  example,  fhe  9lh  enfry 
of  fhe  4fh  level  for  node  325AE  is  fhe  node  elosesf  fo  32  5AE  in  nefwork  disfanee  whieh  ends  in 
95AE. 

When  routing,  fhe  nfh  hop  shares  a  suffix  of  af  leasf  lengfh  n  wifh  fhe  desfinafion  ID.  To  find 
fhe  nexf  roufer,  we  look  af  ifs  {n  +  l)fh  level  map,  and  look  up  fhe  enfry  mafehing  fhe  value  of  fhe 
nexf  digif  in  fhe  desfinafion  ID.  Assuming  eonsisfenf  neighbor  maps,  fhis  routing  mefhod  guaranfees 
fhaf  any  existing  unique  node  in  fhe  sysfem  will  be  found  wifhin  af  mosf  Log^N  logieal  hops,  in 
a  sysfem  wifh  an  N  size  namespaee  using  IDs  of  base  b.  Beeause  every  single  neighbor  map  af  a 
node  assumes  fhaf  fhe  preeeding  digifs  all  mafeh  fhe  eurrenf  node’s  suffix,  if  only  needs  fo  keep  a 
small  eonsfanf  size  (6)  enfries  af  eaeh  roufe  level,  yielding  a  neighbor  map  of  fixed  eonsfanf  size 
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b  ■  LogtN. 


A  way  to  visualize  this  routing  mechanism  is  that  every  destination  node  is  the  root  node  of 
its  own  tree,  which  is  a  unique  spanning  tree  across  all  nodes.  Any  leaf  can  traverse  a  number  of 
intermediate  nodes  en  route  to  the  root  node.  In  short,  the  hashed-suffix  mesh  of  neighbor  maps  is 
a  large  set  of  embedded  trees  in  the  network,  one  rooted  at  every  node.  Figure  1  shows  an  example 
of  hashed-suffix  routing. 

In  addition  to  providing  a  scalable  routing  mechanism.  Tapestry  also  provides  a  set  of  fault- 
tolerance  mechanisms  which  allow  routers  to  quickly  route  around  link  and  node  failures.  Each 
entry  in  the  neighbor  map  actually  contains  three  entries  that  match  the  given  suffix,  where  fwo 
secondary  poinfers  are  available  if  and  when  fhe  primary  roufe  fails.  These  redundanf  roufing  pafhs 
are  utilized  by  Bayeux  protocols  in  Section  7. 


3.2  Data  Location 

Tapesfry  employs  fhis  infraslrucfure  for  dafa  locafion  in  a  sfraighfforward  way.  Each  objecf  is  associ- 
afed  wifh  one  or  more  Tapestry  location  roots  fhrough  a  disfribufed  deferminisfic  mapping  funcfion. 
To  advertise  or  publish  an  objecf  O,  fhe  server  S  storing  fhe  objecf  sends  a  publish  message  foward 
fhe  Tapesfry  locafion  roof  for  fhaf  objecf.  Af  each  hop  along  fhe  way,  fhe  publish  message  sfores 
location  information  in  fhe  form  of  a  mapping  <0bjecl-ID(0),  Server-ID(S')>.  Nofe  fhaf  fhese 
mappings  are  simply  poinfers  fo  fhe  server  S  where  O  is  being  stored,  and  nof  a  copy  of  fhe  objecf 
ifself.  Where  mulfiple  objecfs  exisf,  each  server  mainfaining  a  replica  publishes  ifs  copy.  A  node  N 
fhaf  keeps  location  mappings  for  multiple  replicas  keeps  fhem  sorfed  in  order  of  disfance  from  N. 

During  a  locafion  query,  clienfs  send  messages  direcfly  fo  objecfs  via  Tapesfry.  A  message 
destined  for  O  is  inifially  roufed  fowards  O’s  roof  from  fhe  clienf.  Af  each  hop,  if  fhe  message 
encounfers  a  node  fhaf  confains  fhe  locafion  mapping  for  O,  if  is  redirecfed  fo  fhe  server  confaining 
fhe  objecf.  Ofherwise,  fhe  message  is  forward  one  sfep  closer  to  fhe  roof.  If  fhe  message  reaches 
fhe  roof,  if  is  guaranfeed  to  find  a  mapping  for  fhe  locafion  of  O.  Nofe  fhaf  fhe  hierarchical  nafure 
of  Tapesfry  roufing  means  af  each  hop  fowards  fhe  roof,  fhe  number  of  nodes  safisfying  fhe  nexf 
hop  consfrainf  decreases  by  a  factor  equal  to  fhe  idenfifier  base  (e.g.,  ocfal  or  hexadecimal)  used 
in  Tapesfry.  Eor  nearby  objecfs,  clienf  search  messages  quickly  infersecf  fhe  pafh  faken  by  publish 
messages,  resulting  in  quick  search  resulfs  fhaf  exploif  localify.  Eurfhermore,  by  sorfing  disfance 
fo  mulfiple  replicas  af  infermediafe  hops,  clienfs  are  likely  fo  find  fhe  nearest  replica  of  fhe  desired 
objecf.  These  properties  are  analyzed  and  discussed  in  more  defail  in  [36]. 


3.3  Benefits 

Tapesfry  provides  fhe  following  benefils: 

•  Powerful  Fault  Handling:  Tapesfry  provides  mulfiple  pafhs  fo  every  desfinafion.  This  mech¬ 
anism  enables  application-specific  protocols  for  fasl  failover  and  recovery. 
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•  Scalable:  Tapestry  routing  is  inherently  deeentralized,  and  all  routing  is  done  using  informa¬ 
tion  from  number  of  nodes  logarithmieally  proportional  to  the  size  of  the  network.  Routing 
tables  also  have  size  logarithmically  proportionally  to  the  network  size,  guaranteeing  scala¬ 
bility  as  the  network  scales. 

•  Proportional  Route  Distance:  It  follows  from  Plaxton  et  al.’s  proof  in  [21]  that  the  network 
distance  traveled  by  a  message  during  routing  is  linearly  proportional  to  the  real  underlying 
network  distance,  assuring  us  that  routing  on  the  Tapestry  overlay  incurs  a  reasonable  over¬ 
head.  In  fact,  experiments  have  shown  this  proportionality  is  maintained  with  a  small  constant 
in  real  networks  [36]. 


3.4  Multicast  on  Tapestry 

The  nature  of  Tapestry  unicast  routing  provides  a  natural  ground  for  building  an  application-level 
multicasting  system.  Tapestry  overlay  assists  efficient  multi-point  data  delivery  by  forwarding  pack¬ 
ets  according  to  suffixes  of  lisfener  node  IDs.  The  node  ID  base  defines  fhe  fanout  factor  used  in 
the  multiplexing  of  data  packets  to  different  paths  on  each  router.  Because  randomized  node  IDs 
naturally  group  themselves  into  sets  sharing  common  suffixes,  we  can  use  fhat  common  suffix  to 
minimize  transmission  of  duplicate  packets.  A  multicast  packet  only  needs  to  be  duplicated  when 
the  receiver  node  identifiers  become  divergent  in  the  next  digit.  In  addition,  the  maximum  number 
of  overlay  hops  taken  by  such  a  delivery  mechanism  is  bounded  by  the  total  number  of  digits  in 
the  Tapestry  node  IDs.  For  example,  in  a  Tapestry  namespace  size  of  4096  with  an  octal  base,  the 
maximum  number  of  overlay  hops  from  a  source  to  a  receiver  is  4.  The  amount  of  packet  fan-out  at 
each  branch  point  is  limited  to  the  node  ID  base.  This  fact  hints  at  a  natural  multicast  mechanism 
on  the  Tapestry  infrastructure. 

Note  that  unlike  most  existing  application  level  multicast  systems,  not  all  nodes  of  the  Tapestry 
overlay  network  are  Bayeux  multicast  receivers.  This  use  of  dedicated  infrastructure  server  nodes 
provides  better  optimization  of  the  multicast  tree  and  is  a  unique  feature  of  the  Bayeux/Tapestry 
system. 


4  Bayeux  Base  Architecture 


Bayeux  provides  a  source-specific,  explicit-join  multicast  service.  The  source-specific  model  has 
numerous  practical  advantages  and  is  advocated  by  a  number  of  projects  [13,  31,  33,  35].  A  Bayeux 
multicast  session  is  identified  by  the  tuple  <  session  name,  UID>.  A  session  name  is  a  semantic 
name  describing  the  content  of  the  multicast,  and  the  UID  is  a  distinquishing  ID  that  uniquely 
identifies  a  particular  instance  of  the  session. 


4.1  Session  Advertisement 

We  utilize  Tapestry’s  data  location  services  to  advertise  Bayeux  multicast  sessions.  To  announce  a 
session,  we  take  the  tuple  that  uniquely  names  a  multicast  session,  and  use  a  secure  one-way  hashing 


5 


Figure  2:  Tree  maintenanee 


funetion  (sueh  as  SHA-1  [25])  to  map  it  into  a  160  bit  identifier.  We  then  ereate  a  trivial  file  named 
wifh  fhaf  idenlifier  and  plaee  if  on  fhe  mullieasf  session’s  roof  node. 

Using  Tapesfry  loeafion  serviees,  fhe  roof  or  souree  server  of  a  session  advertises  fhaf  doeumenf 
info  fhe  nefwork.  Clienfs  fhaf  wanf  fo  join  a  session  musf  know  fhe  unique  fuple  fhaf  idenlifies  fhaf 
session.  They  ean  fhen  perform  fhe  same  operations  fo  generafe  fhe  file  name,  and  query  for  if  using 
Tapesfry.  These  searehes  resulf  in  fhe  session  roof  node  reeeiving  a  message  from  eaeh  inferesfed 
lisfener,  allowing  if  fo  perform  fhe  required  membership  operations.  As  we  will  see  in  Seefion  6.1, 
fhis  session  advertisemenf  seheme  allows  roof  replieafion  in  a  way  fhaf  is  fransparenf  fo  fhe  mulfieasf 
lisfeners. 


4.2  Tree  Maintenance 

Consfruefing  an  effieienf  and  robusf  disfribufion  free  fo  deliver  dafa  fo  session  members  is  fhe  key  fo 
effieienf  operation  in  applieafion-level  mulfieasf  sysfems.  Unlike  mosf  existing  work  in  fhis  spaee, 
Bayeux  ufilizes  dedieafed  servers  in  fhe  nefwork  infrasfruefure  (in  fhe  form  of  Tapesfry  nodes)  fo 
help  eonsfruef  more  effieienf  dafa  disfribufion  frees. 

There  are  four  fypes  of  eonfrol  messages  in  building  a  disfribufion  free:  JOIN,  LEAVE, 
TREE,  PRUNE.  A  member  joins  fhe  mulfieasf  session  by  sending  a  JOIN  message  towards  fhe 
roof,  whieh  fhen  replies  wifh  a  TREE  message.  Figure  2  shows  an  example  where  node  7  8  7  6  is  fhe 
roof  of  a  mullieasf  session,  and  node  125  0  fries  to  join.  The  JOIN  message  from  node  125  0  Ira- 
verses  nodes  xxx 6,  xx76,  x876,  and  7  8  7  6  via  Tapesfry  unieasl rouling,  where  xxx 6  denoles 
some  node  fhaf  ends  wifh  6.  The  roof  7  8  7  6  fhen  sends  a  TREE  message  towards  fhe  new  mem¬ 
ber,  whieh  sels  up  fhe  forwarding  slale  al  inlermediale  applieafion-level  roulers.  Note  fhaf  while 
bolh  eonfrol  messages  are  delivered  by  unieasling  over  fhe  Tapesfry  overlay  nefwork,  fhe  JOIN  and 
TREE  palhs  mighl  be  differenl,  due  fo  fhe  asymmelrie  nalure  of  Tapesfry  unieasl  routing. 

When  a  router  reeeives  a  TREE  message,  if  adds  fhe  new  member  node  ID  to  fhe  lisl  of  reeeiver 
node  IDs  fhaf  if  is  responsible  for,  and  updales  ils  forwarding  fable.  For  example,  eonsider  node 
XX 50  on  fhe  palh  from  fhe  roof  node  to  node  125  0.  Upon  reeeiving  fhe  TREE  message  from 
fhe  roof,  node  xx50  will  add  12  50  into  ils  reeeiver  ID  lisl,  and  will  duplieale  and  forward  fulure 
paekels  for  fhis  session  fo  node  x25  0.  Similarly,  a  LEAVE  message  from  an  existing  member 
Iriggers  a  PRUNE  message  from  fhe  roof,  whieh  Irims  from  fhe  disfribufion  free  any  roulers  whose 
forwarding  slates  beeome  empty  after  fhe  leave  operation. 
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5  Evaluation  of  Base  Design 


Here,  we  eompare  the  basie  Bayeux  algorithm  against  IP  multieast  and  naive  unieast.  By  naive 
unieast  we  mean  a  unieast  star  topology  rooted  at  the  souree  that  performs  one-to-one  transmission 
to  all  reeeivers. 


5.1  Simulation  Setup 

To  evaluate  our  protoeol,  we  implemented  Tapestry  unieast  routing  and  the  Bayeux  tree  protoeol  as 
a  paeket-level  simulator.  Our  measurements  foeus  on  distanee  and  bandwidth  metries,  and  do  not 
model  the  effeets  of  any  eross  traffie  or  router  queuing  delays. 

We  use  the  Stanford  Graph  Base  library  [30]  to  aeeess  four  different  topologies  in  our  simu¬ 
lations  (AS,  MBone,  GT-ITM  and  TIERS).  The  AS  topology  shows  eonneetivity  between  Internet 
autonomous  systems  (AS),  where  eaeh  node  in  the  graph  represents  an  AS  as  measured  by  the  Na¬ 
tional  Laboratory  for  Applied  Network  Researeh  [18]  based  on  BGP  routing  tables.  The  MBone 
graph  presents  the  topology  of  the  MBone  as  eolleeted  by  the  SCAN  projeet  at  USC/ISI  [28]  on 
February  1999.  To  measure  our  metries  on  larger  networks,  we  turned  to  the  GT-ITM  [12]  paek- 
age,  whieh  produees  transit-stub  style  topologies,  and  the  TIERS  [34]  paekage,  whieh  eonstruets 
topologies  by  eategorizing  routers  into  LAN,  MAN,  and  WAN  routers.  In  our  experiments,  unieast 
distanees  are  measured  as  the  shortest  path  distanee  between  any  two  multieast  members. 


5.2  Performance  Metrics 

We  adopt  the  two  metries  proposed  in  [6]  to  evaluate  the  effeetiveness  of  our  applieation-level 
multieast  teehnique: 

•  Relative  Delay  Penalty,  a  measure  of  the  inerease  in  delay  that  applieations  ineur  while  using 
overlay  routing.  For  Bayeux,  it  is  the  ratio  of  Tapestry  unieast  routing  distanees  to  IP  unieast 
routing  distanees.  Assuming  symmetrie  routing,  IP  Multieast  and  naive  unieast  both  have  a 
RDP  of  1. 

•  Physical  Link  Stress,  a  measure  of  how  effeetive  Bayeux  is  in  distributing  network  load  aeross 
different  physieal  links.  It  refers  to  the  number  of  identieal  eopies  of  a  paeket  earried  by  a 
physieal  link.  IP  multieast  has  a  stress  of  1,  and  naive  unieast  has  a  worst  ease  stress  equal  to 
number  of  reeeivers. 

5.3  Snapshot  Measurements 

In  this  experiment,  we  used  a  topology  generated  by  the  transit-stub  model  eonsisting  of  50000 
nodes,  with  a  Tapestry  overlay  using  node  namespaee  size  of  4096,  ID  base  of  4,  and  a  multieast 
group  size  of  4096  members.  RDP  is  measured  for  all  pairwise  eonneetions  between  nodes  in 
the  network.  Figure  3  plots  the  eumulative  distribution  of  RDP  on  this  network.  The  horizontal 


7 
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Figure  3:  Cumulative  distribution  of  RDP 


Figure  4:  Comparing  number  of  stressed  links  between  naive  unicast  and  Bayeux  using  Log  scale 
on  both  axis. 


<BASE  4.  NAMESPACE  SIZE  4096,  GROUP  SIZE  4096,  transit-stub  50000> 


Physical  Delay  (hop) 


Figure  5:  RDP  vs.  physical  delay 


axis  represents  a  particular  RDP  and  the  vertical  axis  represents  the  cumulative  fraction  of  sender- 
receiver  pairs  for  which  the  RDP  is  less  than  this  value.  As  we  can  see,  the  RDP  for  a  large  majority 
of  connections  is  quite  low.  In  fact,  about  90%  of  pairs  of  members  have  a  RDP  less  than  4. 

A  few  sender-receiver  pairs  have  a  higher  RDP,  however,  it  can  be  seen  in  Figure  5  that  the 
maximum  RDP  of  seven  corresponds  to  a  sender-receiver  pair  with  a  small  physical  delay  of  five 
hops.  This  is  because  even  though  two  nodes  are  physically  close  to  each  other,  the  digit-by-digit 
nature  of  Tapestry  routing  still  produces  a  path  of  the  same  number  of  overlay  hops,  which  can 
result  in  higher  RDPs.  However,  the  overlay  delay  between  this  sender-receiver  pair  is  not  very 
high,  which  can  be  seen  from  Figure  6. 

In  Figure  4,  we  compare  the  variation  of  physical  link  stress  in  Bayeux  to  that  under  naive 
unicast.  We  define  fhe  sfress  value  as  fhe  number  of  duplicate  packefs  going  across  a  single  physical 
link.  We  pick  random  source  nodes  wifh  random  receiver  groups,  and  measure  fhe  worsf  sfress  value 
of  all  links  in  fhe  free  builf.  We  plof  fhe  number  of  links  suffering  from  a  particular  sfress  level  on 
fhe  Y-axis,  againsf  fhe  range  of  sfress  levels  on  fhe  X-axis.  We  see  fhaf  relafive  fo  unicasf,  fhe  overall 
disfribufion  of  link  sfress  is  subsfanfially  lower.  In  addition,  naive  unicasf  exhibifs  a  much  longer 
fail,  where  certain  links  experience  stress  levels  up  to  4095,  whereas  the  Bayeux  measurement  shows 
no  such  outliers.  This  shows  that  Bayeux  distributes  the  network  load  evenly  across  physical  links, 
even  for  large  multicast  groups.  While  End  System  Multicast  [6]  also  exhibits  low  physical  link 
stress,  it  only  scales  to  receiver  groups  of  hundreds. 
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<BASE  4,  NAMESPACE  SIZE  4096,  GROUP  SIZE  4096,  transit-stub  50000> 


Figure  6:  Overlay  delay  vs.  physical  delay 

5.4  Effects  of  Tunable  Parameters  on  Performance 

In  this  section,  we  study  the  effects  of  varying  parameters  multicast  group  size,  namespace  size, 
topology  size,  and  base  on  the  performance  of  Bayeux.  The  namespace  of  Tapestry  nodes  is  defined 
by  fixed-length  bit  sequences  represented  by  a  common  base.  For  instance,  a  Tapestry  network 
can  support  4096  nodes  using  12  bit  identifiers  represented  as  3  hexadecimal  digits.  For  all  results 
in  the  following  sections,  each  data  point  is  obtained  by  conducting  10  independent  simulation 
experiments,  and  we  plot  the  mean  and  the  standard  deviation. 


5.4.1  Group  Size 

In  this  experiment,  we  use  topologies  from  the  AS,  MBone,  TIERS,  and  transit-stub  models,  a 
Tapestry  namespace  size  of  4096,  and  a  base  of  4.  Figure  7  plots  the  90th  percentile  RDP  versus 
increasing  group  size  for  these  four  topologies.  All  the  curves  are  close  to  each  other  except  the  AS 
topology,  which  shows  slightly  higher  RDPs.  This  is  because  the  connectivity  of  a  topology  directly 
affects  the  properties  of  the  Tapestry  overlay  network  built  on  top  of  it.  Consider  the  difference 
between  AS  and  MBone  topologies.  The  MBone  is  composed  of  islands  that  can  directly  support 
IP  multicast,  where  the  islands  are  linked  by  virtual  point-to-point  tunnels  whose  endpoints  have 
support  for  IP  multicast.  The  MBone  topology  is  a  combination  of  mesh  at  the  backbone  and  star 
at  each  regional  network,  however,  the  connectivity  in  the  mesh  is  manually  configured  and  ad-hoc. 
In  contrast,  the  AS  topology  is  more  structured  and  much  better  connected  with  increasing  amount 
of  peering  relationships  in  the  recent  years.  Therefore,  more  nodes  have  higher  fanouts  in  the  AS 
topology,  which  means  that  there  are  plenty  of  freedom  in  choosing  the  optimal  route  in  shortest 
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<BASE  4,  NAMESPACE  SIZE  4096> 


Figure  7 :  90  percentile  RDP  vs.  group  size  for  topologies  from  four  models 


path  unicast  routing.  However,  unicast  routing  in  Tapestry  is  somewhat  contrained  in  the  sense  that 
routes  have  to  follow  the  destination  node  identifiers,  and  thus  cannot  fully  leverage  the  choice  of 
routes  offered  by  the  underlying  topology,  which  offers  some  intuition  why  the  AS  topology  tend 
to  have  higher  RDPs.  Now  we  look  at  the  overall  variation  in  RDP  as  the  group  size  increases  from 
16  to  4096.  Figure  7  shows  that  the  90  percentile  RDP  remained  more  or  less  constant,  which  is 
expected  because  increasing  the  group  size  only  increases  the  fanouts  of  branching  points,  but  does 
not  increase  the  height  of  the  Bayeux  tree,  thus  not  affecting  the  RDP. 

Next  we  study  the  effect  of  varying  group  size  on  worst  case  physical  link  stress.  We  only 
consider  the  generated  transit-stub  model  of  50000  nodes  because  the  results  are  skewed  in  other 
real  topologies  of  about  5000  nodes  since  the  multicast  session  density  becomes  too  high  for  a  group 
size  of  4096.  Figure  8  plots  the  variation  of  the  worst  case  physical  link  stress  for  the  transit-stub 
model.  The  worst  case  physical  link  stress  increases  sub-linearly  as  the  group  size  increases  from 
16  to  4096.  While  for  large  group  sizes  of  thousands,  worst  case  stress  may  be  higher,  it  is  still 
much  lower  than  naive  unicast. 


5.4.2  Namespace  Size 

In  this  experiment,  we  examine  the  effect  of  varying  the  size  of  the  Tapestry  network  on  the  RDP 
and  the  worst  case  physical  link  stress.  We  use  the  topologies  from  the  AS,  MBone,  TIERS,  and 
transit-stub  models,  and  a  Tapestry  base  of  4.  Because  we  are  only  interested  in  the  variation  of 
performance  with  respect  to  namespace  size,  we  use  a  multicast  group  of  64  members  to  decrease 
the  amount  of  simulation  time.  Figure  9  and  10  plot  the  variations  on  RDP  and  worst  case  stress 
as  the  Tapestry  namespace  size  increases  from  64  to  4096.  For  all  topologies,  we  see  a  slight 
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<BASE  4,  NAMESPACE  SIZE  4096> 


Figure  8:  Worst  case  physical  link  stress  vs.  group  size  for  transit-stub  50000 


<BASE  4,  GROUP  SIZE  64> 


Figure  9:  90  percentile  RDP  vs.  Tapestry  network  size  for  topologies 
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<BASE  4,  GROUP  SIZE  64> 


Figure  10:  Worst  case  physical  link  stress  vs.  Tapestry  network  size  for  topologies 

increase  in  the  RDP  and  worst  case  stress.  This  is  because  we  do  not  gain  additional  benefits  by 
adding  more  Tapestry  nodes  beyond  the  number  of  members  in  the  multicast  group.  In  fact,  the 
performance  degrades  because  as  the  namespace  size  increases  and  the  base  is  kept  constant,  a  node 
needs  to  traverse  a  longer  overlay  path  in  order  to  reach  another  node,  which  increases  the  end-to- 
end  latencies,  and  also  causes  unnecessary  packet  duplications.  With  respect  to  varying  topologies, 
we  note  from  Figure  10  that  the  AS  topology  exhibits  the  lowest  worst  case  stress.  This  is  due  to  the 
same  reasons  as  why  the  AS  topology  has  a  higher  RDP  than  the  other  topologies.  In  other  words, 
because  of  the  higher  fanout  of  nodes  in  the  AS  topology,  more  links  share  the  responsibility  of 
multicast  forwarding  such  that  the  amount  of  load  on  each  individual  link  becomes  lower,  attaining 
a  load  balancing  effect. 


5.4.3  Topology  Size 

In  this  section,  we  use  a  Tapestry  namespace  size  of  64,  a  base  of  4,  and  a  multicast  group  size  of  64. 
We  generate  topologies  from  the  transit-stub  model  of  sizes  varying  from  100  nodes  to  50000  nodes, 
and  evaluate  the  impact  on  Bayeux’s  performance.  Figure  11  plots  the  RDP  against  the  topology 
size,  and  Figure  12  plots  the  worst  case  stress  against  the  topology  size.  We  observe  that  both  the 
RDP  and  worst  case  stress  decrease  in  general  as  the  topology  size  increases.  This  is  because  in  a 
fixed  physical  space,  the  number  of  links  increases  as  the  topology  becomes  larger,  which  results  in 
better  routes  becoming  available. 
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<BASE  4,  NAMESPACE  SIZE  64,  GROUP  SIZE  64> 


Figure  1 1 :  90  percentile  RDP  vs.  topology  size  for  topologies  from  the  transit-stub  model 


<BASE  4,  NAMESPACE  SIZE  64,  GROUP  SIZE  64> 


Figure  12:  Worst  case  physical  link  stress  vs.  topology  size  for  topologies  from  the  transit-stub 
model 
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Figure  13:  90  percentile  RDP  vs.  base  for  topologies  from  the  four  models 
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Figure  14:  Worst  case  physical  link  stress  vs.  base  for  topologies  from  the  four  models 
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5.4.4  Base 


Finally,  we  study  the  effeet  of  variation  of  the  Tapestry  base,  whieh  determines  the  range  of  the 
overlay  fanout  in  the  Bayeux  tree.  We  eonsider  topologies  from  the  AS,  MBone,  TIERS,  and  transit- 
stub  models,  a  Tapestry  namespaee  size  of  4096,  and  a  group  size  of  64.  Figure  13  and  14  plot  the 
variations  in  RDP  and  worst  ease  stress  as  Tapestry  base  inereases.  When  the  base  inereases  and 
the  namespaee  is  kept  eonstant,  the  Bayeux  tree  height  deereases,  whieh  eauses  RDP  to  deerease. 
On  the  other  hand,  the  overlay  fanout  inereases  as  the  base  inereases,  whieh  eauses  physieal  link 
stress  to  inerease  beeause  physieal  links  near  the  branehing  nodes  need  to  be  shared  by  an  inereasing 
number  of  overlay  links. 


5.5  Summary  of  Results 

In  this  seetion,  we  summarize  the  evaluation  results  that  we  have  presented  in  earlier  seetions. 

Aeross  a  range  of  topology  models,  Bayeux  aehieves  a  low  RDP  for  a  wide  range  of  group 
sizes.  Figure  7  shows  that  the  90  pereentile  RDP  remained  more  or  less  eonstant  as  the  group  size 
inereases  from  16  to  4096. 

In  addition,  Bayeux  results  in  a  low  worst  ease  stress  for  a  wide  range  of  group  sizes.  Figure  8 
shows  that  the  worst  ease  stress  inereases  sub-linearly  as  the  group  size  inereases  from  16  to  4096 
for  the  transit-stub  model  of  50000  nodes.  While  for  larger  group  sizes,  worst  ease  stress  may  be 
higher,  it  is  still  mueh  lower  than  unieast.  For  example,  for  a  group  of  4096  members,  Bayeux 
reduees  worst  ease  stress  by  a  faetor  of  85  eompared  to  unieast. 


6  Scalability  Enhancements 


In  this  seetion,  we  demonstrate  and  evaluate  optimizations  in  Bayeux  for  load-balaneing  and  in- 
ereased  effieieney  in  bandwidth  usage.  These  enhaneements.  Tree  Partitioning  and  Receiver  Clus¬ 
tering,  leverage  Tapestry-speeifie  properties,  and  are  unique  to  Bayeux. 


6.1  Tree  Partitioning 

The  souree-speeifie  serviee  model  has  several  drawbaeks.  First,  the  root  of  the  multieast  tree  is  a 
scalability  bottleneck,  as  well  as  a  single  point  of  failure.  Unlike  existing  multicast  protocols,  the 
non-symmetric  routing  in  Bayeux  implies  that  the  root  node  must  handle  all  join  and  leave  re¬ 
quests  from  session  members.  Second,  only  the  session  root  node  can  send  data  in  a  source-specific 
service  model.  Although  the  root  can  act  as  a  reflector  for  supporting  multiple  senders  [13],  all 
messages  have  to  go  through  the  root,  and  a  network  partition  or  root  node  failure  will  compromise 
the  entire  group’s  ability  to  receive  data. 

To  remove  the  root  as  a  scalability  bottleneck  and  point  of  failure,  Bayeux  includes  a  Tree  Par¬ 
titioning  mechanism  that  leverages  the  Tapestry  location  mechanism.  The  idea  is  to  create  multiple 
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Figure  15:  Receivers  self-configuring  into  Tree  Partitions 


root  nodes,  and  partition  receivers  into  disjoint  membership  sets,  each  containing  receivers  closest 
to  a  local  root  in  network  distance.  Receivers  organize  themselves  into  these  sets  as  follows: 

1.  Integrate  Bayeux  root  nodes  into  a  Tapestry  network. 

2.  Name  an  object  O  with  the  hash  of  the  multicast  session  name,  and  place  O  on  each  root. 

3.  Each  root  advertises  O  in  Tapestry,  storing  pointers  to  itself  at  intermediate  hops  between  it 
and  the  Tapestry  location  root,  a  node  deterministically  chosen  based  on  O. 

4.  On  JOIN,  new  member  M  uses  Tapestry  location  services  to  find  and  roufe  a  JOIN  message 
fo  fhe  nearesf  roof  node  R. 

5.  R  sends  TREE  message  fo  M,  now  a  member  of  R'^  receiver  sef. 

Figure  15  shows  fhe  pafh  of  various  messages  in  fhe  free  parfifioning  algorifhm.  Each  member 
M  sends  locafion  requesfs  up  fo  fhe  Tapesfry  location  roof.  Tapesfry  location  services  guarantee 
M  will  find  fhe  closesf  such  roof  wifh  high  probabilify  [21,  36].  Roof  nodes  fhen  use  Tapesfry 
routing  fo  forward  packefs  fo  downsfream  routers,  minimizing  packef  duplicafion  where  possible. 
The  self-configuration  of  receivers  info  parfifioned  sefs  means  roof  replication  is  an  efficienf  fool  for 
balancing  load  befween  roof  nodes  and  reducing  firsl  hop  latency  fo  receivers  when  roofs  are  placed 
near  lisfeners.  Bayeux’s  fechnique  of  roof  replication  is  similar  in  principle  fo  roof  replicafion 
used  by  many  existing  IP  mulficasf  profocols  such  as  CBT  [3]  and  PIM  [7,  8].  Unlike  ofher  roof 
replicafion  mechanisms,  however,  we  do  nof  send  periodic  adverfisemenfs  via  fhe  sef  of  roof  nodes, 
and  members  can  fransparenfly  find  fhe  closesf  roof  given  fhe  roof  node  idenfifier. 

We  performed  evaluafion  of  our  roof  replicafion  algorifhms  by  simulafion.  Our  simulafion  re- 
sulfs  on  four  topologies  (AS,  MBone,  Transif-sfub  and  TIERS)  are  quife  similar.  Here  we  only 
show  fhe  Transif-sfub  resulfs  for  clarify.  We  simulate  a  large  mulficasf  group  fhaf  self-organizes  info 
membership  parfifions,  and  examine  how  replicafed  roofs  impacf  load  balancing  of  membership  op- 
erafions  such  as  join.  Figure  16  plofs  fhe  mean  and  fhe  5lh  and  95lh  percenfiles  of  fhe  number 
of  join  requesfs  handled  per  roof  as  members  organize  fhemselves  around  more  replicated  roofs. 
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<BASE  4,  NAMESPACE  SIZE  4096,  GROUP  SIZE  4063> 


Figure  16:  Membership  Message  Load  Balancing  by  Roots 


While  the  mean  number  of  requests  is  deterministic,  it  is  the  5th  and  95th  percentiles  which  show 
how  evenly  join  requests  are  load-balanced  between  different  replicated  roots.  As  the  number  of 
roots  increases,  the  variation  of  the  number  of  join  requests  handled  among  the  roots  decreases 
inversely,  showing  that  load-balancing  does  occur,  even  with  randomly  distributed  roots,  as  in  our 
simulation.  One  can  argue  that  real-life  network  administrators  can  do  much  better  by  intelligently 
placing  replicated  roots  to  evenly  distribute  the  load. 


6.2  Receiver  Identifier  Clustering 

To  further  reduce  packet  duplication,  Bayeux  introduces  the  notion  of  receiver  node  ID  clustering. 
Tapestry  delivery  of  Bayeux  packets  approaches  the  destination  ID  digit  by  digit,  and  one  single 
packet  is  forwarded  for  all  nodes  sharing  a  suffix.  Therefore,  a  naming  scheme  that  provides  an 
optimal  packet  duplication  tree  is  one  that  allows  local  nodes  to  share  the  longest  possible  suffix. 
For  instance,  in  a  Tapestry  4-digit  hexadecimal  naming  scheme,  a  group  of  16  nodes  in  a  LAN 
should  be  named  by  fixing  the  last  3  digits  (XYZ),  while  assigning  each  node  one  of  the  16  result 
numbers  (OXYZ,  IXYZ,  2 XYZ,  etc.)  This  means  upstream  routers  delay  packet  duplication 
until  reaching  the  LAN,  minimizing  bandwidth  consumption  and  reducing  link  stress.  Multiples  of 
these  16-node  groups  can  be  further  organized  into  larger  groups,  constructing  a  clustered  hierarchy. 
Figure  17  shows  such  an  example.  While  group  sizes  matching  the  Tapestry  ID  base  are  unlikely, 
clustered  receivers  of  any  size  will  show  similar  benefits.  Also  note  that  while  Tapestry  routing 
assumes  randomized  naming,  organized  naming  on  a  small  scale  will  not  impact  the  efficiency  of  a 
wide-area  system. 
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Figure  17:  Receiver  ID  Clustering  according  to  network  distance 


<BASE  4,  NAMESPACE  SIZE  4096,  GROUP  SIZE  256,  CLUSTER  SIZE  16> 


Fraction  of  domains  that  use  receiver  Identifier  clustering 

Figure  18:  Worst  case  physical  link  stress  vs.  fraction  of  domains  that  use  receiver  ID  clustering  for 
the  transit-stub  model 
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<BASE  4,  NAMESPACE  SIZE  4096,  GROUP  SIZE  256,  TIERS  5000> 


Fraction  of  failed  links 

Figure  19:  Maximum  Reachability  via  Multiple  Paths  vs.  Fraction  of  Failed  Links  in  Physical 
Network 

To  quantify  the  effect  of  clustered  naming,  we  measured  link  stress  versus  the  fraction  of  lo¬ 
cal  LANs  that  utilize  clustered  naming.  We  simulated  256  receivers  on  a  Tapestry  network  using 
ID  base  of  4  and  IDs  of  6  digits.  The  simulated  physical  network  is  a  transit  stub  modeled  net¬ 
work  of  50000  nodes,  since  it  best  represents  the  natural  clustering  properties  of  physical  networks. 
Receivers  are  organized  as  16  local  networks,  each  containing  16  members.  Figure  18  shows  the 
dramatic  decrease  in  worst  cast  link  stress  as  node  names  become  more  organized  in  the  local  area. 
By  correlating  node  proximity  with  naming,  the  duplication  of  a  single  source  packet  is  delayed 
until  the  local  router,  reducing  bandwidth  consumption  at  all  previous  hops.  The  result  shows  an 
inverse  relationship  between  worst  case  link  stress  and  local  clustering. 


7  Fault-resilient  Packet  Delivery 


In  this  section,  we  examine  how  Bayeux  leverages  Tapestry’s  routing  redundancy  to  maintain  re¬ 
liable  delivery  despite  node  and  link  failures.  Each  entry  in  the  Tapestry  neighbor  map  maintains 
secondary  neighbors  in  addition  to  the  closest  primary  neighbor.  In  Bayeux,  membership  state  is 
kept  consistent  across  Tapestry  nodes  in  the  primary  path  from  the  session  root  to  all  receivers. 
Routers  on  potential  backup  routes  branching  off  the  primary  path  do  not  keep  member  state.  When 
a  backup  route  is  taken,  the  node  where  the  branching  occurs  is  responsible  for  forwarding  on  the 
necessary  member  state  to  ensure  packet  delivery. 

We  explore  in  this  section  approaches  to  exploit  Tapestry’s  redundant  routing  paths  for  effi¬ 
cient  fault-resilient  packet  delivery,  while  minimizing  the  propagation  of  membership  state  among 
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Figure  20:  Average  Hops  Before  Convergence  vs.  Position  of  Branch  Point 


Tapestry  nodes.  We  first  examine  fault-resilient  properties  of  the  Tapestry  hierarchical  and  redun¬ 
dant  routing  paths,  then  present  several  possible  protocols  and  present  some  simulation  results. 


7.1  Infrastructure  Properties 

A  key  feature  of  the  Tapestry  infrastructure  is  its  backup  routers  per  path  at  every  routing  hop. 
Before  examining  specific  protocols,  we  evaluafe  fhe  maximum  benefil  such  a  roufing  sfrucfure  can 
provide.  To  Ibis  end,  we  used  simulafion  fo  measure  maximum  connecfivify  based  on  Tapesfry 
mulfi-pafh  roufes.  Af  each  router,  every  oufgoing  logical  hop  mainfains  fwo  backup  poinfers  in 
addition  fo  fhe  primary  route. 

Figure  19  shows  maximum  connecfivify  compared  fo  IP  routing.  We  used  a  topology  gener- 
afed  by  fhe  TIERS  model  consisting  of  5000  nodes  and  7084  links.  Resulfs  are  similar  for  ofher 
topologies.  We  used  a  Tapesfry  node  idenfifer  namespace  size  of  4096,  a  base  of  4,  and  a  mulficasf 
group  size  of  256  members.  Links  are  randomly  dropped,  and  we  monitor  fhe  reachabilify  of  IP 
and  Tapesfry  roufing.  As  link  failures  increase,  region  A  shows  probabilify  of  successful  IP  and 
Tapesfry  roufing.  Region  C  shows  cases  where  IP  fails  and  Tapesfry  succeeds.  Region  E  represenfs 
cases  where  fhe  desfinafion  is  physically  unreachable.  Einally,  region  B  shows  insfances  where  IP 
succeeds,  and  Tapesfry  fails;  and  region  D  shows  where  bofh  protocols  fail  to  route  to  a  reachable 
destination.  Nofe  fhaf  regions  B  and  D  are  almosf  invisible,  since  fhe  multiple  pafhs  mechanism 
in  Tapesfry  finds  a  route  fo  fhe  desfinafion  wifh  exfremely  high  probabilify,  if  such  a  roufe  exisfs. 
This  resulf  shows  fhaf  by  using  fwo  backup  poinfers  for  each  roufing  map  enfry,  Tapesfry  achieves 
near-opfimal  maximum  connecfivify. 

Anofher  nofable  properly  of  fhe  Tapesfry  routing  infraslruclure  is  ils  hierarchical  nafure  [36].  All 
possible  roufes  to  a  destination  can  be  characterized  as  pafhs  up  fo  a  free  roofed  af  fhe  desfinafion. 
Wifh  a  random  dislribulion  of  names,  each  addilional  hop  decreases  fhe  expecled  number  of  nexl 
hop  candidales  by  a  factor  equal  to  fhe  base  of  fhe  Tapesfry  identifier.  This  properly  means  fhaf 
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with  evenly  distributed  names,  paths  from  different  nodes  to  the  same  destination  eonverge  within 
an  expeeted  number  of  hops  equal  to  Logi,{D),  where  b  is  the  Tapestry  digit  base,  and  D  is  number 
of  nodes  between  the  two  origin  nodes  in  the  network. 

This  eonvergent  nature  allows  us  to  intentionally  fork  off  duplieate  paekets  onto  alternate  paths. 
Reeall  that  the  alternate  paths  from  a  node  are  sorted  in  order  of  network  proximity  to  it.  The 
expeetation  is  that  a  primary  next  hop  and  a  seeondary  next  hop  will  not  be  too  distant  in  the 
network.  Beeause  the  number  of  routers  sharing  the  required  suffix  deereases  quiekly  with  eaeh 
additional  hop,  alternate  paths  are  expeeted  to  quiekly  eonverge  with  the  primary  path.  We  eonfirm 
this  hypothesis  via  simulation  in  Figure  20.  On  a  transit-stub  topology  of  5000  nodes.  Tapestry  IDs 
with  base  4,  where  the  point  to  point  route  has  6  logieal  hops,  we  see  that  eonvergenee  oeeurs  very 
quiekly.  As  expeeted,  an  earlier  braneh  point  may  ineur  more  hops  to  eonvergenee,  and  a  seeondary 
route  eonverges  faster  than  a  tertiary  route. 


7.2  Fault-resilient  Delivery  Protocols 

We  now  examine  more  elosely  a  set  of  Bayeux  paeket  delivery  protoeols  that  leverages  the  redundant 
route  paths  and  hierarehieal  path  reeonvergenee  of  Tapestry.  While  we  list  several  protoeols,  we  only 
present  simulation  results  for  one,  and  eontinue  to  work  on  simulation  and  analysis  of  the  others. 
The  protoeols  are  presented  in  random  order  as  follows: 

1.  Proactive  Duplication:  Eaeh  node  forwarding  data  sends  a  duplieate  of  every  paeket  to  its 
first  baekup  route.  Duplieate  paekets  are  marked,  and  routers  on  the  seeondary  path  eannot 
duplieate  them,  and  must  forward  them  using  their  primary  routers  at  eaeh  hop. 

The  hypothesis  is  that  duplieates  will  all  eonverge  at  the  next  hop,  and  duplieation  at  eaeh 
hop  means  any  single  failure  ean  be  eireumvented.  While  ineurring  a  higher  overhead,  this 
protoeol  also  simplifies  membership  sfafe  propagation  by  limiting  Iraffie  fo  fhe  primary  pafhs 
and  firsl  order  seeondary  nodes.  Membership  sfafe  ean  be  senf  fo  fhese  nodes  before  fhe 
session.  This  profoeol  frades  off  addifional  bandwidfh  usage  for  eireumvenfing  single  logieal 
hop  failures. 

2.  Application-specific  Duplicates:  Similar  fo  previous  work  leveraging  appliealion-speeifie 
dafa  distilling  [20],  fhis  profoeol  is  an  enhaneemenf  fo  Proactive  Duplication,  where  an 
applieafion-speeifie  lossy  duplieafe  is  sen!  fo  fhe  alfernafe  link.  In  sfreaming  mulfimedia, 
fhe  duplieate  would  be  a  reduefion  in  qualify  in  exehange  for  smaller  paekef  size.  This  pro¬ 
vides  fhe  same  single-failure  resilienee  as  profoeol  1,  wifh  lower  bandwidfh  overhead  fraded 
off  for  qualify  degradation  following  paekef  loss  on  fhe  primary  pafh. 

3.  Prediction-based  Selective  Duplication:  This  profoeol  ealls  for  nodes  fo  exehange  periodie 
UDP  probes  wifh  fheir  nexf  hop  routers.  Based  on  a  moving  history  window  of  probe  arrival 
sueeess  rates  and  delay,  a  probabilify  of  sueeessful  delivery  is  assigned  to  eaeh  oufgoing  link, 
and  a  eonsequenf  probabilify  ealeulafed  for  whefher  a  paekef  should  be  sen!  via  eaeh  link. 
The  weighfed  expeefed  number  of  oufgoing  paekefs  per  hop  ean  be  varied  fo  eonfrol  fhe  use 
of  redundaney  (e.g.,  befween  1  and  2). 
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When  backup  routes  are  taken,  a  copy  of  the  membership  state  for  the  next  hop  is  sent  along 
with  the  data  once.  This  protocol  incurs  the  overhead  of  periodic  probe  packets  in  exchange 
for  the  ability  to  adapt  quickly  to  transient  congestion  and  failures  at  every  hop. 

4.  Explicit  Knowledge  Path  Selection:  This  protocol  calls  for  periodic  updates  to  each  node  from 
its  next  hop  routers  on  information  such  as  router  load/congestion  levels  and  instantaneous 
link  bandwidth  utilization.  Various  heuristics  can  be  employed  to  determine  a  probability 
function  which  choose  the  best  outgoing  path  for  each  packet.  Packets  are  not  duplicated. 

5.  First  Reachable  Link  Selection:  This  protocol  is  a  relatively  simple  way  to  utilize  Tapestry’s 
routing  redundancy.  Like  the  previous  protocol,  a  node  receives  periodic  UDP  packets  from 
its  next  hop  routers.  Based  on  their  actual  and  expected  arrival  times,  the  node  can  construct 
a  brief  history  window  to  predict  short-term  reliability  on  each  outgoing  route.  Each  incom¬ 
ing  data  packet  is  sent  on  the  shortest  outgoing  link  that  shows  packet  delivery  success  rate 
(determined  by  the  history  window)  above  a  threshold.  No  packet  duplication  takes  place. 
When  a  packet  chooses  an  alternate  route,  membership  state  is  sent  along  with  the  data.  This 
protocol  is  discussed  more  in  Section  7.3. 

Note  that  several  of  these  protocols  (1,  2,  3)  may  send  additional  packets  down  secondary  or  ter¬ 
tiary  routes  in  addition  to  the  original  data.  As  we  have  shown  in  Figure  20,  the  bandwidth  overhead 
of  those  protocols  is  limited,  since  the  duplicates  quickly  converge  back  on  to  the  primary  path,  and 
can  be  suppressed.  This  gives  us  the  ability  to  route  around  single  node  or  link  failures.  Duplicate 
packet  supression  can  be  done  by  identifying  each  packet  with  a  sequential  ID,  and  keeping  track  of 
the  packets  expected  but  not  received  (in  the  form  of  a  moving  window)  at  each  router.  Once  either 
the  original  or  the  duplicate  packet  arrives,  it  is  marked  in  the  window,  and  the  window  boundary 
moves  if  appropriate.  All  packets  that  have  already  been  received  are  dropped. 


7.3  First  Reachable  Link  Selection 

Each  of  the  above  protocols  has  advantages  and  disadvantages,  making  them  best  suited  for  a  variety 
of  different  operating  conditions.  We  present  here  our  evaluation  of  First  Reachable  Fink  Selection 
(FRFS),  by  first  examining  its  probability  of  successful  packet  delivery,  and  then  simulating  the 
increasing  latency  associated  with  sending  membership  state  along  with  the  data  payload. 

Figure  21  shows  that  FRFS  delivers  packets  with  very  high  success  rate  despite  link  failures. 
The  regions  are  marked  similarly  to  that  of  Figure  19,  where  region  A  represents  successful  routing 
by  IP  and  Tapestry,  region  B  is  where  IP  succeeds  and  Tapestry  fails,  region  C  is  where  IP  fails  and 
Tapestry  succeeds,  region  D  is  where  a  possible  route  exists  but  neither  IP  nor  Tapestry  find  if,  and 
region  E  is  where  no  pafh  exisfs  fo  fhe  desfinafion.  When  compared  fo  Figure  19,  we  see  fhaf  by 
choosing  a  simple  algorifhm  of  faking  fhe  shorfesf  predicfed-success  link,  we  gain  almost  all  of  the 
potential  fault-resiliency  of  the  Tapestry  multiple  path  routing.  The  end  result  is  that  FRFS  delivers 
packets  with  high  reliability  in  the  face  of  link  failures. 

FRFS  delivers  packets  with  high  reliability  without  packet  duplication.  The  overhead  comes  in 
the  form  of  bandwidth  used  to  pass  along  membership  state  to  a  session’s  backup  routers.  FRFS 
keeps  the  membership  state  in  each  router  on  the  primary  path  that  the  packets  traverse.  The  size 
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<BASE  4,  NAMESPACE  SIZE  4096,  GROUP  SIZE  256,  TIERS  5000> 


Fraction  of  failed  links 

Figure  21:  Fault-resilient  Packet  Delivery  using  First  Reachable  Link  Selection 
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Figure  22:  Bandwidth  Delay  Due  to  Member  State  Exchange  in  FRLS 
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of  membership  state  transmitted  deereases  for  routers  that  are  further  away  from  the  data  souree 
(multieast  root).  For  example,  a  router  with  ID  “4  7512  9”  that  is  two  hops  away  from  the  root 
keeps  a  list  of  all  members  with  Tapestry  IDs  ending  in  2  9,  while  another  router  42  0  62  9  two 
hops  down  the  multieast  tree  will  keep  a  list  of  all  members  with  IDs  ending  in  0  62  9.  When  a 
baekup  route  is  taken  and  routing  branehes  from  the  primary  path,  the  router  at  the  branehing  point 
forwards  the  relevant  portion  of  its  own  state  to  the  braneh  taken,  and  forwards  it  along  with  the 
data  payload.  This  eauses  a  delay  for  the  multieast  data  direetly  proportional  to  the  size  of  member 
state  transmitted. 

We  plot  a  simulation  of  average  delivery  lateney  in  FRLS,  ineluding  the  member  state  trans¬ 
mission  delay,  on  a  transit-stub  5000  node  topology,  using  both  base  4  and  base  8  for  Tapestry  IDs. 
Note  that  average  time  to  delivery  does  not  inelude  unreaehable  nodes  as  failure  rate  inereases.  Fig¬ 
ure  22  shows  that  as  link  failures  inerease,  delivery  is  delayed,  but  not  dramatieally.  The  standard 
deviation  is  highest  when  link  failures  have  foreed  half  of  the  paths  to  resort  to  baekup  links,  and  it 
spikes  again  as  the  number  of  reaehable  reeeivers  drops  and  reduees  the  number  of  measured  data 
points. 


8  Related  Work 

There  are  several  projeets  that  share  the  goal  of  providing  the  benefits  of  IP  multieast  without  requir¬ 
ing  direet  router  support  ([5,  6,  10,  14,  19,  23,  27]).  End  System  Multieast  [6]  is  one  sueh  example 
targeted  towards  small-sized  groups  sueh  as  audio  and  video  eonfereneing  applieations,  where  every 
member  in  the  group  is  a  potential  souree  of  data.  However,  it  does  not  seale  to  large-sized  multi¬ 
east  groups  beeause  every  member  needs  to  maintain  a  eomplete  list  of  every  other  member  in  the 
group.  The  Seattereast  work  by  Chawathe  et  al.  [5]  is  similar  to  the  End  System  Multieast  approaeh 
exeept  in  the  explieit  use  of  infrastrueture  serviee  agents,  SCXs.  Both  Seattereast  and  End  System 
Multieast  build  a  mesh  strueture  aeross  partieipating  nodes,  and  then  eonstruet  souree-rooted  trees 
by  running  a  standard  routing  protoeol.  On  the  other  hand,  Yalleast  [10]  direetly  builds  a  spanning 
tree  strueture  aeross  the  end  hosts  without  any  intermediate  mesh  strueture,  whieh  requires  expen¬ 
sive  loop  deteetion  meehanisms,  and  is  also  extremely  vulnerable  to  partitions.  The  CAN  multieast 
work  by  Ratnasamy  et  al.  [23]  and  the  SCRIBE  work  by  Rowstron  et  al.  [27]  are  similar  to  Bayeux 
in  that  they  aehieve  sealability  by  leveraging  the  sealable  routing  infrastrueture  provided  by  systems 
like  CAN  [22],  Pastry  [26],  and  Tapestry  respeetively.  However,  these  systems  have  not  foeused  on 
fault-tolerant  paeket  delivery  as  a  primary  goal. 

In  terms  of  the  serviee  model,  EXPRESS  [13]  also  adopts  a  souree-speeifie  paradigm,  and 
augments  the  multieast  elass  D  address  with  a  unieast  address  of  either  the  eore  or  the  sender. 
This  eliminates  the  address  alloeation  problem  and  provides  support  for  sender  aeeess  eontrol.  In 
eontrast,  Bayeux  goes  one  step  further  and  eliminates  the  elass  D  address  altogether.  Using  only  the 
UID  and  session  name  to  identify  the  group  makes  it  possible  to  provide  additional  features,  sueh 
as  native  ineremental  deployability,  and  load  balaneing  at  the  root. 

The  idea  of  root  replieation  shows  a  promising  approaeh  of  providing  anyeast  serviee  at  the 
applieation  level.  Reeently,  IP-anyeast  has  been  proposed  as  an  infrastrueture  serviee  for  multieast 
routing.  Eor  example,  Kim  et  al.  use  anyeast  to  allow  PIM-SM  to  support  multiple  rendezvous 
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points  per  multicast  tree  [15].  However,  there  is  a  lack  of  a  globally  deployed  IP-anycast  service. 
There  are  several  proposals  for  providing  an  anycast  service  at  the  application  layer  ([4,  9,  11,  17, 
29]),  which  attempt  to  build  directory  systems  that  return  the  nearest  server  when  queried  with 
a  service  name  and  a  client  address.  Although  our  anycast  service  is  provided  at  the  application 
layer,  server  availability  is  discovered  by  local  Tapestry  nodes  and  updated  naturally  as  a  part  of  the 
Tapestry  routing  protocol.  Therefore,  our  mechanism  may  potentially  provide  an  anycast  service 
that  is  easier  to  deploy  than  IP-anycast,  yet  avoids  several  complications  and  scalability  problems 
associated  with  directory-based  application  layer  anycast.  We  believe  that  the  application  layer 
anycast  provided  by  the  Tapestry  overlay  network  described  herein  forms  an  interesting  topic  for 
future  research. 

Finally,  there  are  several  recent  projects  focusing  on  similar  goals  as  Tapestry.  Among  them  are 
Chord  [32]  from  MIT/Berkeley,  Content-Addressable  Networks  (CAN)  [22]  from  AT&T/ACIRI 
and  Pastry  [26]  from  Rice  and  Microsoft  Research.  These  research  projects  have  also  produced  de¬ 
centralized  wide-area  location  and  routing  services  with  fault-tolerant  properties,  but  only  Tapestry 
provides  explicit  correlation  between  overlay  distance  and  underlying  network  distance. 


9  Future  Work 

In  this  report,  we  have  studied  the  properties  of  the  First  Reachable  Link  Selection  (FRLS)  proto¬ 
col,  it  will  be  worthwhile  to  explore  and  understand  the  performance  and  tradeoffs  involved  in  the 
alternative  fault-resilient  delivery  protocols  discussed  in  Section  7.  In  particular,  it  will  be  useful 
to  look  at  the  effect  of  different  parameters  on  each  protocol,  and  their  performance  under  varying 
operating  conditions. 

The  Streaming  Media  Systems  Group  at  HP  Labs  has  developed  a  multiple  state  video  en- 
coder/deoder  and  a  path  diversity  transmission  system  [1,2],  which  sends  different  subsets  of  pack¬ 
ets  over  different  paths.  The  multiple  state  video  codec  seem  to  fit  well  with  our  packet  duplication 
techniques  onto  alternate  paths,  and  is  an  interesting  area  for  future  research. 

Finally,  it  will  be  worthwhile  to  conduct  large  scale  Internet  experiments  with  emphasis  on 
studying  the  dynamics  of  Bayeux,  and  effects  of  packet  loss  and  cross-traffic. 


10  Conclusion 


In  conclusion,  we  have  presenfed  an  archifecfure  for  Infernef  confenf  disfribufion  fhaf  leverages 
Tapesfry,  an  exisfing  faull-foleranl  roufing  infraslruclure.  Simulafion  resulfs  show  fhaf  Bayeux 
achieves  scalabilify,  efficiency,  and  highly  faull-resilienl  packef  delivery.  We  believe  Bayeux  shows 
fhaf  an  efficienf  nefwork  protocol  can  be  designed  wifh  simplicify  while  inheriting  desirable  prop- 
erfies  from  an  underlying  applicafion  infraslruclure. 
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A  Appendix 


In  this  appendix,  we  diseuss  the  setup  of  our  simulator  used  to  earry  out  the  experimental  analysis 
deseribed  earlier.  We  implemented  Tapestry  unieast  routing  and  the  Bayeux  tree  protoeol  by  extend¬ 
ing  the  Stanford  Graph  Base  library  (SGB)  [36],  whieh  is  a  platform  for  eombinatorial  eomputing. 
The  SGB  library  eontains  routines  to  manipulate  graph  struetures,  sueh  as  file  formats,  input/output 
funetions  and  shortest  path  ealeulations.  We  deseribe  the  various  eomponents  of  the  simulator  in 
the  following  seetions. 

A.l  SGB  Modification 

gb_graph.w  9  extra  vertex  utility  fields  and  4  extra  are  utility  fields  are  added.  This  file  need  to  be 
put  into  the  SGB  souree  eode  direetory  before  installing  SGB. 

A.2  Generic  Functions 

cluster.c  funetions  that  implement  the  Reeeiver  Clustering  sealability  enhaneement  diseussed  in 
Seetion  6.2. 

fault.  {c,h}  funetions  that  injeet  link  and  node  failures  into  the  underling  physieal  network 
graph.{c,h}  funetions  that  interaets  with  the  SGB  graph  struetures 
hop.{c,h}  funetions  that  measure  routing  delays 

max_conn.c  funetions  that  measure  Maximum  Reaehability  via  Multiple  Paths  diseussed  in  See¬ 
tion  7.1. 

nodeid.{c,h}  funetions  that  implement  various  Tapestry  node  ID  eonversions 
pick.{c,h}  funetions  that  piek  vertiees  from  the  Tapestry  network 
protocoLc  funetions  that  implement  the  FRLS  protoeol  diseussed  in  Seetion  7.3 
route.{c,h}  funetions  that  build  the  Tapestry  routing  table  for  every  node 
stack.{c,h}  funetions  that  implement  the  staek  data  strueture 
stat.{c,h}  funetions  that  implement  various  statistie  routines 
stress.{c,h}  funetions  that  measure  stress  on  physieal  links 

tree_partition.c  funetions  that  implement  the  Tree  Partitioning  sealability  enhaneement  diseussed 
in  Seetion  6.1. 

util.{c,h}  funetions  that  implement  various  utility  routines 
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A.3  Experimental  Main  Loop 

exp.{c,h}  functions  that  implement  the  initialization  and  running  of  the  experiments 
main.c  functions  that  setup  and  run  the  experiments 

A.4  Post  Processing 

read.{c,h}  functions  that  read  delay  and  stress  values  for  post  processing 
post_proc.c  functions  that  post  process  experimental  results 
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